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STATISTICAL INFERENCE OF TRANSMISSION FIDELITY OF 
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We develop Bayesian inference methods for a recently-emerging 
type of epigenetic data to study the transmission fidelity of DNA 
methylation patterns over cell divisions. The data consist of parent- 
daughter double-stranded DNA methylation patterns with each pat- 
tern coming from a single cell and represented as an unordered pair of 
binary strings. The data are technically difficult and time-consuming 
to collect, putting a premium on an efficient inference method. Our 
aim is to estimate rates for the maintenance and de novo methyla- 
tion events that gave rise to the observed patterns, while account- 
ing for measurement error. We model data at multiple sites jointly, 
thus using whole-strand information, and considerably reduce con- 
founding between parameters. We also adopt a hierarchical structure 
that allows for variation in rates across sites without an explosion in 
the effective number of parameters. Our context-specific priors cap- 
ture the expected stationarity, or near-stationarity, of the stochastic 
process that generated the data analyzed here. This expected sta- 
tionarity is shown to greatly increase the precision of the estimation. 
Applying our model to a data set collected at the human FMR1 lo- 
cus, we find that measurement errors, generally ignored in similar 
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studies, occur at a nontrivial rate (inappropriate bisulfite conver- 
sion error: 1.6% with 80% CI: 0.9-2.3%). Accounting for these errors 
has a substantial impact on estimates of key biological parameters. 
The estimated average failure of maintenance rate and daughter de 
novo rate decline from 0.04 to 0.024 and from 0.14 to 0.07, respec- 
tively, when errors are accounted for. Our results also provide ev- 
idence that de novo events may occur on both parent and daugh- 
ter strands: the median parent and daughter de novo rates are 0.08 
(80% CI: 0.04-0.13) and 0.07 (80% CI: 0.04-0.11), respectively. 

1. Introduction. In this paper we develop statistical models and infer- 
ence methods to address an important problem in epigenetic biology: infer- 
ence of the fidelity with which DNA methylation patterns in DNA are pre- 
served over somatic cell divisions in mammals. The double-stranded DNA 
methylation data we present here have the potential to yield important bi- 
ological insights. However, due to limitations of current experimental tech- 
nologies, these data also present challenges. For example, it is difficult to 
obtain this type of data in large quantities, some key biological variables 
are unobservable, certain parameters of interest may be confounded, and 
the data are subject to measurement error at perhaps a nontrivial rate 
[Genereux et al. (2008)]. These characteristics put a premium on efficient 
inference methods that make full use of the data while dealing with com- 
plexities intrinsic to the data and to the biological problem. In this section we 
introduce, for a statistical audience, relevant biological background on DNA 
methylation and the hairpin-bisulfite PCR technique [Laird et al. (2004); 
Miner et al. (2004)] used to collect the data. We then state our aim and give 
overviews of existing methods and our new approach. 

A DNA molecule is most commonly described by its sequence of nu- 
cleotides, consisting of adenine, cytosine, guanine and thymine; or A, C, G 
and T. However, this description is incomplete in that it omits some func- 
tionally relevant features. An important example is that some nucleotides 
are methylated — that is, at some nucleotide positions a methyl group has 
been chemically attached to the DNA. Methylation is an epigenetic mecha- 
nism, such that the pattern of methylation along a DNA molecule can pro- 
foundly effect its function. Aberrant methylation plays a role in many can- 
cers [Chen and Riggs (2005); Jones and Baylin (2002); Laird (2003)] and in 
several human developmental diseases, including fragile X syndrome [Laird 
(1987); Robertson and Wolffe (2000)]. In fragile X syndrome, hypermethyla- 
tion of the FMR1 locus on the X chromosome leads to many manifestations, 
including mental retardation. A critical distinction between methylation pat- 
terns and nucleotide sequences is that, whereas the latter are generally as- 
sumed to be identical in nearly all cells of an organism, the former can vary 
considerably from cell to cell within an organism. The processes that govern 
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DNA methylation and its variability across cells are thus of considerable 
biological interest. 

In mammals, methylation of DNA occurs almost exclusively on cytosines 
(Cs) that are followed by a guanine (G), locations referred to as CpG 
sites. On average, about 70-80% of CpG sites in mammals are methylated 
[Ehrlich et al. (1982)]. A key property of DNA molecules is that they are 
double-stranded, with the two strands complementary to each other, that is, 
A pairs with T, and C with G. Hence, we also refer to CpG sites as CpG/CpG 
dyads to emphasize their double-strandedness. At a CpG/CpG dyad, methyl 
groups can be present on both strands (which we call "methylated"), on one 
strand ( "hemimethylated" ) , or on neither strand ( "unmethylated" ) . Our fo- 
cus here is on the accuracy with which the pattern of methylation on one 
strand (the parent strand) of a DNA molecule is transferred to the new 
complementary strand (the daughter strand) produced by DNA replication 
[see Supplementary Material Section 1 in Fu et al. (2009) for details]. 

The transmission process of methylation patterns is complex and imper- 
fect: cytosines are first incorporated into DNA and subsequently methylated. 
Sometimes, however, a cytosine on the daughter strand remains unmethy- 
lated even when the parent is methylated, an event we refer to as a failure of 
maintenance event. Methylation is also sometimes introduced at previously 
unmethylated locations; we refer to such events as de novo methylation 
events. Here, as in Genereux et al. (2005), we allow the possibility of de 
novo events on both parent and daughter strands. Figure 1 illustrates these 
concepts under a widely-accepted model for the transmission process [Bird 
(2002, 2007)]. 

Here we consider the problem of using double-stranded DNA methyla- 
tion patterns to estimate the rates at which failure of maintenance and 
de novo methylation events occur. We collected these double-stranded data 
using hairpin-bisulfite PCR [Laird et al. (2004)], which was modified as in 
Miner et al. (2004) to include molecular codes to authenticate each DNA 
methylation pattern, removing redundant and contaminant patterns [de- 
tails of the experimental design are in Supplementary Material Section 2 
in Fu et al. (2009)]. Several features of hairpin-bisulfite PCR are particu- 
larly relevant to statistical modeling: (i) a short "hairpin" DNA sequence 
links together complementary parent and daughter strands; (ii) linked strand 
pairs are subject to bisulfite conversion which reveals their double-stranded 
methylation patterns; and (iii) errors arise due to imperfections in the bisul- 
fite conversion process [Genereux et al. (2008)]. Thus, this method yields 
data, subject to measurement error due to bisulfite conversion, on methyla- 
tion patterns for parent-daughter pairs from individual molecules. Current 
experimental technologies, however, cannot determine strand type; that is, 
which strand is the parent and which the daughter. 
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Data on double-stranded methylation patterns obtained by hairpin-bisulfite 
PCR have previously been analyzed by Laird et al. (2004) and Genereux et al. 
(2005). The analysis in Laird et al. (2004), which is not explicitly likelihood- 
based, involves counting the number of events of each type at each site over 
strands and then averaging the counts over the two possible assignments of 
strand identity, assuming that de novo events occur only on the daughter 
strand. Genereux et al. (2005) assumed strict stationarity of the stochastic 
process that generates the data and based their analysis on a likelihood 
function for individual CpG sites without incorporating information about 
which observations at different sites in a double-stranded molecule are on 
the same strand, and which are on different strands. These existing analyses 
provide the foundations for our work here. 

Here we develop a full statistical model for the data, exploiting informa- 
tion from contiguous sites rather than from individual sites alone. Three 
additional innovations of our modeling approach follows: (i) account- 

ing for measurement errors, which are due to imperfections in the bisulfite 
conversion process; (ii) relaxing the strict stationarity assumption made in 
Genereux et al. (2005); and (iii) using a hierarchical structure to allow rates 
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Fig. 1. The transmission process of DNA methylation patterns in mammalian somatic 
cells. The two strands in a DNA molecule each become parent strands during DNA repli- 
cation (from left column to middle column), used as a template to synthesize a daughter 
strand (red lines). During the short, intermediate stage (middle column), daughter strands 
are completely unmethylated, whereas parent strands have the same methylation patterns 
as before replication. Subsequently, methyl groups are added to cytosmes (right column). 
Failure of maintenance and de novo methylation events can occur, leading to differences in 
methylation patterns on parent and daughter strands. Binary vectors P; , Qi and Di, where 
i = 1,2, denote methylation patterns on a pre-replication parent strand, on a post-replica- 
tion parent strand and on a daughter strand, respectively. 
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of key parameters to vary across sites without greatly increasing the effective 
size of the parameter space. 

2. Models and methods. 

2.1. Basic model and key assumptions. We consider data collected using 
hairpin-bisulfite PCR, on methylation states at S CpG sites on N double- 
stranded DNA molecules. We denote an unmethylated CpG site by 0, and 
a methylated CpG site by 1, so the data are ./V pairs of binary strings, 
{xi,yi}, . . . , {xtv, Yn}, each string being of length S. Current technologies 
are not able to identify strand type; that is, we do not know which data 
vector (xj or y,) arose from the parent strand and which from the daughter. 
Hence, we use {•} to represent this lack of ordering in each observed pair. 
We initially assume that the data are observed without error and then relax 
this assumption. 

Our model introduces latent random variables Qj and Dj, each a bi- 
nary vector, representing patterns of methylation on the parent strand and 
daughter strand, respectively. These binary vectors may be thought of as 
potentially-imperfect copies of the patterns of methylation on the unob- 
served pre-replication parent strand, which we denote by binary vector Pj 
(Figure 1). Differences between Pj and Dj can arise due to failure of main- 
tenance, or de novo methylation on the daughter strand; differences between 
Pi and Qj can arise due to de novo methylation on the parent strand. We 
assume that these three types of events occur independently of one another, 
and independently across individuals and across sites. Denoting the prob- 
abilities of these events at site j by 1 — fij , 5dj and 8 P - , respectively, and 
assuming no spontaneous loss of methylation on the parent strand (explained 
below), we have 

(2.1) Pi(Dij = 0\Pij = 1) = 1 — fij (failure of maintenance), 

(2.2) Pr(Qij = l\Pij = 0) = 5 P - (de novo methylation on parent), 

(2.3) Pr(Dij = l\Pij = 0) = 5dj (de novo methylation on daughter). 

We are interested in estimating failure of maintenance and de novo methy- 
lation rates at CpG sites and assessing their variability across sites. We use 
A = {fJ>j,5 P j,5dj,j = 1, • • • , S} to denote the vector of parameters. 

To derive the likelihood function for those parameters, we make three 
key assumptions. The first assumption is that there is no active removal of 
methyl groups on the parent strand. That is, if the parent strand is methy- 
lated before replication, then it will also be methylated after replication: 
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Although recent publications, such as Metivier et al. (2008) and 
Kangaspeska et al. (2008), suggest the possibility that transcriptionally ac- 
tive loci can have very rapid changes in methylation patterns which may be 
due to active removal of methyl groups from the template DNA, there is no 
evidence so far that this active removal occurs at inactive loci in leukocytes, 
the locus type and the cell type from which our data were collected. This 
assumption is also consistent with what underlies the models in Laird et al. 
(2004) and Genereux et al. (2005). The conditional probability in (2.4) then 
joins those in (2.1)-(2.3) to form a complete probabilistic characterization 
of the transmission process at a single CpG site. 

The second assumption is that methylation events occur independently of 
one another across sites. Equations (2.1)-(2.4), together with this assump- 
tion, determine the conditional distribution of the ordered pair (Qj,Dj) 
given Pj, which we denote h\ (Table 1). To complete the specification of 
the distribution of (Qij,Dij), we further model P^s as independent Bernoulli 
random variables with methylation probability mf 

(2.5) Pr(P ij = l) = rn j . 

Under this second assumption, we obtain the likelihood function for a single 
double-stranded methylation pattern with known strand type as the product 
of probabilities of methylation patterns at individual sites, each probability 
summing over two possibilities of the methylation status (represented by pij) 
on the pre-replication parent strand Pj. Specifically, we give the likelihood 
for the case where Xj contains data from the parent strand and contains 
data from the daughter strand: 

Pr((Q i ,D i ) = (x i ,y i )|A) 

= J[ ^2 h x (xi j ,y ij ]pi j )m p j i: >(l -rrij) 1 Vi i . 
j=lpij=0 

Since strand type is unobserved, to get the probability of the observed 
double-stranded methylation pattern i, we must sum over the two possible 
assignments of strand type: 

Pr({Q J ,D i } = {x i ,y i }|A) 

(2 - 7) i< , 

-J {Pr((Q i ,D i ) = (xi, yi )|A) +Pr((Q i ,D i ) = (y^x^A)}, 



where 1(A) is the indicator function, taking value 1 if condition A is true 
and otherwise. 
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Table 1 






Probabilities 


of methylation events at site j , 




h\(qij,dij;pn) = 


= Pr((Oii,D«) = 


(qij , dij ) | Pij — pij ) 




dij ) Pij = Pi j 


h\{qij,d i:j ;pij) 


Methylation event 


(0, 0) 


1 





fAssumed not to occur 


(0, i) 


1 





fAssumed not to occur 


(1,0) 


1 


1 - N 


Failure of maintenance 


(1, 1) 


1 




Maintenance 


(0, 0) 





(l-5 pj )(l-S dj ) 


No de novo on parent or daughter 


(0, i) 





(1 - S pj )S dj 


De novo on daughter but not parent 


(1,0) 





S Pj (l - S dj ) 


De novo on parent but not daughter 


(1, 1) 





Spj Sdj 


De novo on parent and daughter 



The dagger f indicates cases not possible under the assumption of no active removal of 
methyl groups on the parent strand. 



By making the third assumption that data from the N double-stranded 
methylation patterns are independent draws from the same distribution with 
parameter A, we then obtain a likelihood function of A for all N patterns: 

N 

(2.8) L(A; {x, y}) = ]J Pr({Q i5 D,} = {x,, A) 

i=l 
N 

(2.9) oc []{Pr((Q i ,D i ) = (xi, yi )|A) + Pr((Q i ,D i ) = (y^x^A)}. 

i=l 

2.2. Incorporating measurement error and estimating error rates. As men- 
tioned in Section 1, imperfection in bisulfite conversion is an important 
source of potential measurement error here and in other applications in- 
volving bisulfite conversion. In brief, bisulfite conversion is an experimental 
technique that aims to convert unmethylated cytosines to a different base, 
uracil, thus allowing unmethylated and methylated locations to be iden- 
tified by DNA sequencing. Imperfections during this process can lead to 
two types of error: failure of conversion, where bisulfite fails to convert an 
unmethylated cytosine (resulting in a truly unmethylated site being mea- 
sured as methylated) and inappropriate conversion, where bisulfite converts 
a methylated cytosine to a thymine (leading to a truly methylated site be- 
ing measured as unmethylated). We let b = (b±, . . . , bs) and c = (ci, . . . , cs) 
denote the respective rates at which these two types of errors occur, where 
the elements bj and Cj represent the error rates at site j. 

To incorporate these measurement errors into the model, we introduce 
random variables Q'^ and D'- to denote the observed methylation states 
on the post-replication parent strand and the daughter strand, while con- 
tinuing to use Qij and to denote true methylation states on those two 
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Table 2 

Rates of bisulfite conversion error, which are conditional 
probabilities of the observed methylation state being different from 

a given true methylation state. Specifically, bj is the failure of 
conversion rate at site j and Cj the inappropriate conversion rate 







Observed (Q< 3 or£)y) 






1 


Truth (Qij or Dij) 





1-bj bj 




1 


c j 1 - C J 



strands. We assume that errors occur independently across CpG sites and 
DNA strands, so that the conditional distribution of the observed data given 
the true states is given by 

Pr((Q;,£K.) = (xij^KQij, Ai)) 

(2.10) 

= Pr (<2ij =Xij\Qij)Pr(Di j = y ij \D ij ), 

where each term on the right-hand side is a function of bisulfite conversion 
error rates bj and Cj as in Table 2. 

We extend the parameter vector A to incorporate these measurement error 
parameters, A = {/i, 6 p , 5d, b, c}. The likelihood function, allowing for mea- 
surement error, becomes 

L(A;{x,y}) 

(2.11) 

N 

(x [J{Pr((Q< J D , i ) = (Xi, yi )|A) + PrtfC&Di) = (y 4 ,x,)|A)}, 

i=l 

where 

Pr((Q / ,,D0 = (x,,y i )|A) 
s 1 1 

(2.12) = II E E p *((QiP D 'n) = (?ij,ViMQii> D n) = fei^y)) 

j=lq t j=0dij=0 

1 

x ^2 hx(qij,dij;pij)m^ 3 (l -m j ) 1 ~' Pij , 

Pij=0 

and Pr((Q^,D^) = (y,;,Xj)|A) is defined similarly. 

2.3. Hierarchical model for variability in rates across sites. In the above 
formulation we have allowed that rates may take different values across sites 
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j = 1, . . . , S. In practice, there is limited information about the rates at any 
given site, so attempting to estimate each of these parameters separately will 
produce highly variable estimates. To overcome this challenge, we employ a 
hierarchical model to effectively reduce the dimensionality of the parameter 
space and to borrow strength across sites. In this hierarchical model we 
assume that the components of the vectors /x, 5 P , 5 d , m and c each follow a 
beta distribution. 

In specifying these beta distributions, we find it convenient to use the pa- 
rameterization Beta(r, g) to denote the beta distribution with mean r and 
variance gr{l — r), hence referring to the parameter g as the "scaled" vari- 
ance. The relationship between this parametrization and the conventional 
a-/3 parametrization is 



We prefer the r-g parameterization in our analysis because (i) r and g 
are easily interpretable; and (ii) this parametrization facilitates specification 
of sensible priors — in particular, it is reasonable to assume r and g to be 
independent a priori. 

Our hierarchical model assumes a separate set of r and g for each of the 
vectors fx, 5 P , 5 d , b and c: 



The methylation probability vector, m, is dealt with slightly differently, 
as described in the next section. 

2.4. Incorporating stationarity. Previous analyses of these types of data 
[Genereux et al. (2005)] have been based on the assumption that the trans- 
mission process has attained temporal stationarity; that is, at each site the 
proportion of methylated CpGs is stable over generations of cell division. 
Supporting biological evidence for this assumption comes from observations 
that methylation densities at the FMR1 locus were virtually unchanged 
over a five-year time span in several human males with fragile X syndrome 
[Stoger et al. (1997)]. 






for a Beta(a, (3) random variable X with density 
(2.14) f(x)<xx a - 1 {l-xf- 1 . 



(2.15) 
(2.16) 
(2.17) 
(2.18) 
(2.19) 



fij -Beta^,^), 
5 Pj ~ Bet&(r dp ,g dp ) 
b~dj ~ ~Beta,(r dd ,g dd ) 
bj ~ Bet&(r b ,g b ), 
cj ~ Beta(r c ,g c ). 
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The assumption of stationarity in Genereux et al. (2005) imposes the fol- 
lowing strict relationship between the methylation probability mj and the 
failure of maintenance and de novo methylation rates: 

(2.20) m- 



1 + 8 Pj + d,ij - fij 

Requiring strict equality in this equation appears to be a rather strong 
assumption. Indeed, examples in Fu (2008) illustrate the strong effect this 
assumption can have on the likelihood surface. 

Thus, to avoid making this strong assumption, and to improve robust- 
ness to departures from strict stationarity, we exploit the flexibility of the 
Bayesian modeling approach to allow for deviations from strict equality in 

(2.20) . 

Specifically, to incorporate stationarity, we assume that each mj follows 
a beta distribution, 

(2.21) mj ~ Beta,(r mj , g m ) , 
with mean parameter 

S Pj + 5 dj 



mj 1 + S p . + S d j - fij 



(2.22) 

This distribution on mj is centered on its expected value under the station- 
arity assumption, but allows for deviations, measured by g m : at a CpG site 
small values of g m represent near-stationarity, whereas large values indicate 
substantial deviations from stationarity. 



2.5. Bayesian inference and choice of priors. We choose to use a Bayesian 
approach to fit the hierarchical model, specifying priors for the values of 
mean r and scaled variance g in beta distributions (2.15)-(2.19). 

We assign an independent uniform prior to each r: a Uniform(0, 1) prior 
for each of r M , rg p and rg d , and a Uniform(0, 0.06) prior for r c because ex- 
perimental results suggest that measurement error rate Cj is likely to be 
below 0.06. We can use a similar method to estimate r& (and for the 
other error rate bj, although in our data analysis bj is fixed to an estimate 
obtained from experiments (see Section 3). 

We assign a Uniform(— 4, 0) prior to each log 10 g. This choice of prior has 
the flexibility of capturing a wide range of beta distributions with qualita- 
tively different levels of variability. Figure 2 illustrates this point: for a fixed 
mean value r, as log 10 g increases, the beta distribution becomes more and 
more spread out over the support (0, 1). In other words, this choice of prior 
on log 10 <7 allows us to model cases ranging from little variation (top row 
in Figure 2) to where a few sites have very different rates from the other 
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Fig. 2. Shape of a beta distribution changes with respect to scaled variance g. Data were 
simulated for beta distributions with mean r — 0.05 and different values of scaled variance 
g. Ranges of the horizontal and the vertical axes are different between rows. As g increases, 
the histogram spreads out to the entire support of (0, 1) and a second peak at 1 starts to 
appear (bottom right panel). 



sites (for example, bottom right plot in Figure 2). In Table 3 we provide 
guidelines on the interpretation of log 10 g. 

We fit the model using Markov chain Monte Carlo (MCMC) methods 
[Supplementary Material Section 3 in Fu et al. (2009)]. To check the reli- 
ability of the output of these methods, we applied the algorithm to many 
simulated data sets, and also confirmed that point estimates of parameters 
from simpler versions of the model we present here agreed closely with maxi- 
mum likelihood estimates obtained from an expectation-maximization (EM) 
algorithm. See Fu (2008) for further details. 



Table 3 
Guidelines on the interpretation 
of the scaled variance g on the 
log 10 scale 

log 10 <7 Variability 

<— 3 Very low 

—3 to —2 Low 

-2 to -1 Medium 

>-l High 
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2.6. Origins of data. We collected DNA methylation patterns from the 
FMR1 locus (see Section 1 for associated disease) on the X chromosome 
in leukocytes using hairpin-bisulfite PCR [experimental conditions as in 
Laird et al. (2004) and Miner et al. (2004); also briefly discussed in the 
Supplementary Material Section 2 in Fu et al. (2009)]. Due to cell-cell vari- 
ation, double-stranded methylation patterns were collected from multiple 
cells in each individual sampled. The data analyzed here contain 169 double- 
stranded methylation patterns, each from a single cell, from 6 independent 
normal females (15-33 cells or patterns per individual) at 22 CpG sites 
(chrX: 146800867-146801008) in the promoter region of the FMR1 locus. 
Each female cell has two X chromosomes: one is hypermethylated, hence 
primarily inactive, and the other hypomethylated, and hence primarily ac- 
tive. The data presented here are from the hypermethylated FMR1 locus 
on the inactive X chromosome in each cell sampled. Although this data set 
may be considered to be limited in size, the data are unusual in their double- 
strandedness compared to the single-stranded methylation data commonly 
produced from high-throughput technologies. A small subset of these data, 
which contains 33 methylation patterns at 7 CpG sites, was published in 
Genereux et al. (2005). 

3. Results. We applied our model to the FMR1 data described above. 
Since these data were collected from the primarily hypermethylated (hence 
inactivated) X chromosome in normal females, the methylation density is 
high as expected: 81.9% of all CpG dyads are methylated on both strands, 
6.4% are methylated on just one strand, and 11.7% are unmethylated on 
both strands [see Section 1 and [Genereux et al. (2005)] for previous analysis 
of a subset of the data]. 

Here we treat those double-stranded methylation patterns from the six 
different individuals as independent samples from a single, homogeneous 
population of methylation patterns. This treatment is effectively equivalent 
to assuming no variations in mj, fij, 5 P - and S^j across the individuals. 
This seems to be a reasonable starting point, given the current absence of 
evidence for notable variations in at least some of these parameters among 
individuals [[Stoger et al. (1997)]]. In a more elaborate analysis, however, we 
could relax this assumption by incorporating variability across individuals 
into our hierarchical model. Furthermore, our model does not distinguish be- 
tween methylation patterns from X chromosomes inherited from the mother 
and those inherited from the father. Information on the parental origins of 
a given methylation pattern is not available for our FMR1 data. 

The failure of bisulfite conversion rate, b, is relatively straightforward 
to estimate directly for the methylation patterns analyzed [Supplementary 
Material Section 2 in [Fu et al. (2009)]]. We estimated b to be 0.003 for our 
FMR1 data [[Laird et al. (2004)]] and in the analysis here fixed it to be 
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constant across sites. By comparison, the inappropriate conversion rate c is 
harder to obtain directly. We estimate this rate in our data analysis. 

We performed three independent runs of our MCMC fitting procedure, 
each from a different starting point: two runs of 1.44 million iterations, and 
a third run of 2.88 M iterations (total compute time ~ 160 hours on a 2.4 
GHz CPU). We sampled each MCMC run every 2 K iterations (or 4 K for 
the third run) after discarding the initial 20% of each run as burn-in. Trace 
plots displaying MCMC samples versus iterations (not shown) provided no 
indication of poor mixing. Histograms of key parameters from different runs 
(not shown) also agreed closely with one another. Results below come from 
pooling the samples from the three runs. These long runs were carried out 
to ensure convergence and may have exceeded necessity; in fact, we achieved 
similar results from much shorter pilot runs of 50 K iterations. When us- 
ing credible intervals to summarize posterior distributions, we provide 80% 
coverage, which is not unduly influenced by long tails of the distributions. 

3.1. Rate of measurement error due to inappropriate bisulfite conversion, 
and its variability. The FMR1 data provide strong evidence for the occur- 
rence of inappropriate conversion error: the posterior distribution for the 
mean error rate r c across CpG sites is centered on 0.016, with 80% credible 
interval (CI) of (0.009, 0.023) (top histogram in Figure 3), and there is little 
probability mass very near 0. The posterior distribution for scaled variance 
g c is concentrated on small values, suggesting that the error rate c does 
not vary greatly across CpG sites (Figure 4A), in accord with experimental 
findings [[Genereux et al. (2008)]]. 

3.2. Failure of maintenance rate and its variability. We estimate the 
mean failure of maintenance rate 1 — across CpG sites to be 0.024 (80% 
CI: 0.017-0.031; side histogram in Figure 3). MCMC samples of 1 — and r c 
show a striking linear relationship (Figure 3), suggesting a degree of uniden- 
tifiability in these parameters. This relationship turns out to conform very 
closely to predictions under a much simpler analysis based on summarizing 
the FMR1 data by the overall proportions of methylated, hemimethylated 
and unmethylated sites (pm,Ph,Pu) = (0.82,0.064,0.116) [red line in Figure 
3; see the analysis in the Supplementary Material Section 4.1 in [Fu et al. 
(2009)]]. This agreement between two very different analysis approaches sug- 
gests the robustness of the inference that 1 — r„ and r c lie close to this line. 
The fact that our MCMC samples are concentrated on only part of this line 
reflects the additional information we are able to extract from the full data 
by making more detailed modeling assumptions as stated in Section 2.1. 
The inference, of course, must then be less robust to deviations from these 
assumptions. 
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Fig. 3. Posterior distributions and scatter plot of mean failure of maintenance rate, 
1 — r M , and mean error rate, r c , under the multi-site model for the FMR1 data. The red 
line, 1 — fj, = 1.04c + 0.04, indicates a predicted relationship for these estimates under a 
much simpler analysis (see Section 3.2). 

Regarding variability of 1 — fx across CpG sites, the data suggest that this 
variability is low, since the posterior for is concentrated around small 
values (Figure 4B). 

3.3. De novo methylation rates and their variability. Our results suggest 
that de novo rates may be substantially larger than failure of maintenance 
rates (which can happen even at stationarity): the posterior distribution for 
the median parent and daughter de novo rates are centered on 0.08 and 0.07, 
respectively, with very low probability mass near (histograms in Figure 
5; compare with Figure 3). These high rate estimates are consistent with, 
and may partly explain, the high overall methylation rates in this genomic 
region. There is, however, considerable uncertainty in these estimates: 80% 
CIs are 0.04-0.13 and 0.04-0.11, respectively (histograms in Figure 5). Note 
that the scatter plot shows that these two parameters are not independent, 
a posteriori: in particular, it is unlikely that both de novo rates are at the 
upper end of these CIs (no MCMC sample in the scatter plot in Figure 5 
has both rates > 0.13). 

One biological question of interest is whether or not de novo events occur 
on both parent and daughter strands. We do not conduct a formal test of 
hypotheses here, but we note that the posterior distribution of the median of 
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Fig. 4. Posterior distributions oflog 1Q g for the FMR1 data, g in (A)-(D) is the scaled 
variance in the beta distribution assumed for (A) measurement error rate c ( due to inap- 
propriate bisulfite conversion); (B) failure of maintenance rate 1 — fi; (C) parent de novo 
rate S p ; and (D) daughter de novo rate &d- In (E), g m reflects deviation from the station- 
arity assumption. See Table 3 for guidelines on the interpretation of values oflog 10 g. The 
y-axis in (C) has a wide range. 



each de novo rate has little probability mass near (Figure 5), in contrast 
to the prior distribution, providing informal support for both parent and 
daughter de novo events occurring. 

Regarding variability across sites, the data are uninformative for variabil- 
ity in the daughter de novo rate: the posterior for log 10 <7dd is hat over the 
whole support (Figure 4D). In contrast, the parent de novo rate 8 P may 
vary considerably across sites: log 10 (7dp is concentrated on large values (see 
Figure 4C and compare with the bottom right panel in Figure 2). Further- 
more, a few outlying sites have possibly high rates (Figure 6B, in contrast 
to little variation in 1 — \x in Figure 6A and in 5d in Figure 6C), which may 
have a strong influence on the mean value across sites. This large variability 
makes it difficult to estimate mean de novo rates and renders them mislead- 
ing in summarizing site-specific 5 p s. Therefore, we have chosen to report the 
median de novo rates. 

The observation that 5 P may vary considerably across sites brings into 
question the suitability of our assumption of a single beta distribution for 
these rates, since this assumption has limited flexibility in dealing with 
potential outliers. To examine this issue, we modified our model to allow 
the de novo rate parameters to follow a mixture of two beta distributions, 
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Fig. 5. Posterior distributions and scatter plot of median de novo rates 5 P and 8d under 
the multi-site model for the FMR1 data. Smooth curves in the histograms are density 
functions of the median of 22 beta random variables, each corresponding to a parent (or 
daughter) de novo rate at a CpG site. The two density curves are identical because the 
prior distributions for the rates are identical. 



where the component corresponding to the outlying sites was assumed to be 
Uniform(0, 1) [that is, Beta(a = 1,(3= 1)]. Analyses using this model con- 
tinued to suggest that some sites (specifically sites 10, 14, 15 and 16) may 
have substantially higher parent de novo rates than others (see [Fu (2008)] 
for further details). Indeed, the data at these four sites are characterized by 
particularly small numbers of unmethylated CpG dyads (0 at site 16, 1 at 
sites 10 and 14, and 3 at site 15, in contrast to the median of 20 at other 
sites). 

Our analysis differs from previous analyses by accounting for measurement 
errors which have rate c. To gain insight into how incorporating error rates 
affects estimated de novo rates, we examined the joint posterior distribution 
of c and the average de novo rate (Figure 7). As in the analagous plot for 
failure of maintenance rate (Figure 3), posterior samples here also lie close to 
a line, which is in close agreement with a simple analysis based on summary 
statistics [Supplementary Material Section 4.1 in [Fu et al. (2009)]]. Remarks 
made above in Section 3.2 regarding robustness of the conclusions apply 
equally here. 

Another important novel contribution of our analysis is that, by modeling 
the strand information in multi-site data, we can distinguish, at least in prin- 
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Fig. 6. Median and 80% credible intervals of CpG- site- specific estimates of (A) failure 
of maintenance rates 1 — \x, (B) parent de novo rates 8 P and (C) daughter de novo rates 
3d under the multi-site model for the FMR1 data. The numbering of the CpG sites follows 
the convention established in [Stoger et al. (1997)]. 



ciple, between the two different types of de novo events. This novel feature 
makes it possible to draw several conclusions mentioned above, particularly 
that the data support the occurrence of both parent and daughter de novo 
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Fig. 7. Joint posterior distribution of the median of the average of the parent and daugh- 
ter de novo rates, (S p + 5d)/2, and the median of the error rate c for the FMR1 data. The 
red curve, (S p + 84) /2 = 0.44 + 0.05/(c — 0.15), indicates a predicted relationship for these 
estimates under a much simpler analysis (see text). 
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events, and that the data provide different information on the variability of 
5 P and 84- However, due to the relative complexity of our model, it is dif- 
ficult to identify the source of the information that distinguishes daughter 
de novo events from parent de novo events. To gain insight, we examined in 
some detail the multi-site likelihood for a single methylation pattern [Sup- 
plementary Material Section 4.2 in [Fu et al. (2009)]]. A conclusion from 
this investigation is that, assuming stationarity (or, in fact, under weaker 
assumptions), data on methylation patterns where one strand is much more 
methylated than the other will tend to favor large estimates of 8 P relative 
to 8d. Additionally, the more methylated strand will tend to be the parent 
strand. Thus, an intuitive explanation of our inference that sites 10, 14, 15 
and 16 have large S p is that some patterns, with large differences in the 
methylation density on the two strands, are hemimethylated at these sites 
(with the methylated CpG more likely to be on the overall more methylated 
strand). The novel insights into the de novo rates we gain here are further 
discussed in Section 4. 

3.4. Stationarity. To examine the extent to which the data are con- 
sistent with a stationary model, we consider the posterior distribution of 
log 10 <7 m , which reflects deviations from stationarity (Figure 4E). This pos- 
terior largely follows the uniform prior, except that large values are excluded. 
We conclude that the data do not exhibit large deviations from stationar- 
ity, although they do not provide strong support for the strict stationarity 
assumption either. 

3.5. Impact of bisulfite conversion errors on the estimation of failure of 
maintenance and de novo methylation rates. Our analyses above incorpo- 
rate both types of bisulfite conversion errors, which have not been accounted 
for in previous analyses of methylation patterns [see, e.g., [Genereux et al. 
(2005)]; [Laird et al. (2004)]; [Ushijima et al. (2003)]; [Pfeifer et al. (1990)]]. 
It seems possible that our incorporation of measurement error may be the 
main reason for discrepancies between our estimates of rates of methylation 
events and estimates from these previous analyses. To assess this, we reran 
the multi-site model on the FMR1 data, setting the two measurement error 
rates b and c to be 0, which corresponds to ignoring both types of bisul- 
fite conversion errors. We carried out three independent runs from different 
starting points. Each run consisted of 1.44 million iterations, including 20% 
burn-in, and took about 38 hours. These runs gave consistent results, so we 
pooled the three runs to produce the posterior distributions. 

Our results show that incorporating measurement errors indeed has sub- 
stantial effects on the inference of failure of maintenance rate 1 — \x and 
daughter de novo rate 8^ (Figure 8A and C) but little effects on parent 
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Fig. 8. Impact of measurement errors (due mainly to inappropriate bisulfite conversion 
error) on the inference of the rates of methylation events. Solid lines incorporate these 
errors, whereas dashed lines do not. From top to bottom are posterior densities of medians 
of (A) failure of maintenance rate 1 — fi, (B) parent de novo rate S p and (C) daughter de 
novo rate 8d- Vertical bars indicate the 80% credible interval (10% and 90%o percentiles) 
for each density. 



de novo rate 5 P (Figure 8B). Estimates of these two rates under the no- 
error model are largely consistent with previous results [[Laird et al. (2004)]; 
[Genereux et al. (2005)]]. This comparison suggests that whether or not mea- 
surement error is accounted for may have been an important factor in pro- 
ducing different inferences. 

4. Discussion and conclusions. We have developed a statistical model for 
double-stranded DNA methylation patterns to investigate a central problem 
in epigenetic biology: the transmission fidelity of DNA methylation patterns 
in somatic mammalian cells. Our modeling approach addresses several chal- 
lenges that are inherent in these data and that have not been approached by 
previous methods. Key innovations of our model include the incorporation 
of measurement error and the incorporation of available "phase" informa- 
tion; that is, which hemimethylated CpG dyads are methylated on the same 
strand, by examining multiple sites simultaneously. The first innovation is 
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important because, as we have shown, measurement error has a substantial 
effect on estimates of fidelity rates. The second is important because it al- 
lows us both to separately estimate parent and daughter de novo rates, and 
to relax the strict stationarity assumption that underlies most existing ap- 
proaches [see, for example, [Otto and Walbot (1990)]; [Pfeifer et al. (1990)]; 
[Genereux et al. (2005)]]. 

By applying our new model to the FMR1 data, we gained several new 
insights into methylation transmission fidelity rates. Below we summarize 
our findings and compare them with other studies: 

1. Inappropriate bisulfite conversion can be a significant source of measure- 
ment error. We estimated the mean rate of this error in our data set to be 
0.016 (80% CI: 0.009-0.023). As far as we are aware, ours is the first esti- 
mate of this inappropriate conversion rate obtained from genomic methy- 
lation pattern data that are double-stranded and molecularly-validated 
(see Section 1 for detail on data collection). Our estimate of this rate 
is lower than that obtained by [Genereux et al. (2008)] using synthetic 
oligonucleotides (average rate 0.035; 95% confidence interval: 0.027-0.049). 
This difference may derive, in part, from the different lengths of the DNAs 
used in the two experiments [[Genereux et al. (2008)]]. 

2. We estimated the mean maintenance rate fi to be 0.976 (80% CI: 0.969- 
0.983). This is higher than previous estimates for similar data [[Laird et al. 
(2004)]; [Genereux et al. (2005)]], which can be mostly explained by the 
fact that these previous analyses did not account for bisulfite conversion 
errors. On the other hand, Pfeifer et al. (1990) estimated the mainte- 
nance rate to be much higher (about 0.999), which is mainly due to a 
high overall methylation density (~0.98) at the site analyzed. 

3. We found suggestive evidence that de novo events occur on both par- 
ent and daughter strands, in that posterior distributions for both parent 
and daughter de novo rates have little probability mass near 0. Previous 
empirical studies have asked whether de novo events can occur on the par- 
ent strand [[Kappler (1970)]; [Adams (1971)]; [Schneiderman and Billen 
(1973)]; [Bird (1978)]], yielding conflicting conclusions for different cell 
types. Recent analyses still could not address this question because phase 
information was either not available [[Pfeifer et al. (1990)]; [Ushijima et al. 

(2003) ]] or not incorporated in their models [[Laird et al. (2004)]; 
[Genereux et al. (2005)]]. To accommodate those limitations, [Pfeifer et al. 
(1990)] estimated the total de novo rate as a whole, whereas [Laird et al. 

(2004) ] and [Genereux et al. (2005)] imposed additional constraints that 
are equivalent to estimating the total de novo rates. Potential implica- 
tions of a positive parent de novo rate are discussed in [Genereux (2009)]. 

4. We also found some evidence that parent de novo rates vary considerably 
across sites. In particular, sites 10, 14, 15 and 16 in our data may expe- 
rience unusually high parent de novo rates. Analyses of the data at these 
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sites individually using the single-site approach from [Genereux et al. 
(2005)] also suggested potentially large values for the total de novo rate 
at these sites, although the single-site approach was unable to separately 
estimate the de novo rates on parent and daughter strands. 

Some previous studies estimated an overall methylation transmission fi- 
delity rate, tracking methylation patterns over one [[Bird (1978)]] or more 
[[Ushijima et al. (2003)]] rounds of DNA replication. Different experimen- 
tal techniques and sampling procedures used in these studies led to data of 
very different types from that of our FMR1 data. A fair comparison of the 
results is difficult to carry out because of these differences and is therefore 
not addressed here. 

A limitation of our model is the assumption that methylation events oc- 
cur independently across CpG sites. This assumption does not seem to hold 
in practice, especially for maintenance events, in light of current research on 
methylation enzymes [[Vilkaitis et al. (2005)]; [Goyal, Reinhardt and Jeltsch 
(2006)]]. It is therefore of great interest to study the dependence structure. 
In separate work we developed statistical models to incorporate the depen- 
dence [[Fu (2008)]]. Our preliminary results there yielded similar estimates 
of at least the mean rates (parameter r) of the methylation events to the 
estimates in this paper. 

As more hairpin-bisulfite PCR data become available, the new statistical 
analysis methods described here may continue to provide novel biological 
insights into epigenetic fidelity. The estimation precision will improve as 
new experimental protocols yield data with lower measurement error rates 
[[Genereux et al. (2008)]] and lead to better estimates of de novo methy- 
lation rates. With our statistical methods one can investigate differences 
among fidelity rates in different genomic regions. For example, our model 
can be applied also to sparsely methylated CpG islands where de novo rates 
may take on a wider range of values than in densely methylated regions 
[[Ushijima et al. (2003)]; [Laird et al. (2004)]]. Furthermore, relaxation of 
the stationarity condition gives our methods great flexibility to examine the 
transmission of methylation patterns in cases where methylation densities 
are dynamic rather than stationary. Many of these cases have important 
clinical and pharmaceutical implications; they include early developmen- 
tal stages characterized by loss and re-establishment of methylation pat- 
terns [[Reik, Dean and Walter (2001)]], aging during which methylation pat- 
terns may change over time in at least certain cell types [[Wilson and Jones 
(1983)]], and in several types of cancer in which methylation patterns change 
rapidly over cell generations [[Foster et al. (1998)]]. These cases will pose 
new challenges. For instance, sets of methylation patterns collected from 
cancer patients may be sampled from a mixture of cancer cells and nor- 
mal cells. Successful analysis of such data must account for the existence of 
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these subpopulations, a challenging yet intriguing research direction for the 
future. 
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Supplement A: Appendices (DOI: 10.1214/09-AOAS297SUPPA; .pdf). 
The pdf file contains biological background, experimental design issues, 
Markov chain Monte Carlo (MCMC) procedures and likelihood analyses 
for special cases. 

Supplement B: Data and MCMC code (DOI: 10.1214/09-AOAS297SUPPB; 
.zip). The zip file contains the FMR1 data analyzed in this paper, the R 
code that implements the MCMC procedure and MCMC outputs summa- 
rized and displayed in Section 3. 
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